Python Collections (Sets & Sequences)

  • Python also has built-in types that are compound, meaning that they group together multiple objects.
    • These are called collections or containers.
    • The objects in a collection are generally referred to as items or elements.
  • Collections can hold any number of items and even contain items with a mixture of types

There are three categoris of built-in collection types in Python:

  1. Sets
  2. Sequences
  3. Mappings

  • Streams are another kind of collection whose elements form a series in time, rather than the computer’s memory space.
  • Modules such as array and collections provide additional collection types
  • The distinguishing feature of the collection categories is how individual elements are accessed:
    • Sets don't allow individual access
    • Sequences use numeric indexes
    • mappings use keys
    • streams just provide one address - "next"
  • Some types are immutable - once created they cannot be changed i.e. elements cannot be added, removed or reordered.
  • Primitive types are immutable
  • Mutable built-in collections cannot be added to sets or used as mapping keys.
  • Certain basic operations and functions are applicable to all collection types

Sets

  • A set is an unordered collection of items that contains no duplicates. (Mutable)
  • The type frozenset is an immutable version of set.
  • Sets can be created by passing a string as the argument to call to set.
  • Sets can also be created syntactically by enclosing comma-separated values in curly braces ({}). This construction is called a set display.
In [1]:
set('TCAGTTAT')
Out[1]:
{'A', 'C', 'G', 'T'}
In [46]:
DNABases = {'T', 'C', 'A', 'G'}
In [3]:
RNABases = {'U', 'C', 'A', 'G'}
In [48]:
DNABases
Out[48]:
{'A', 'C', 'G', 'T'}
In [5]:
RNABases
Out[5]:
{'A', 'C', 'G', 'U'}
In [6]:
{'TCAG'}
Out[6]:
{'TCAG'}
In [7]:
{'TCAG', 'UCAG'}
Out[7]:
{'TCAG', 'UCAG'}
In [49]:
{'AATTGC'}
Out[49]:
{'AATTGC'}
In [50]:
set('AATTGC')
Out[50]:
{'A', 'C', 'G', 'T'}

In [53]:
# Rewrite validate_base_sequence using sets
DNAbases = set('TCAGtcag')
RNAbases = set('UCAGucag')
def validate_base_sequence(base_sequence, RNAflag = False):
    """Return True if the string base_sequence contains only upper- or lowercase
    T (or U, if RNAflag), C, A, and G characters, otherwise False"""
    return set(base_sequence) <= (RNAbases if RNAflag else DNAbases)
In [55]:
validate_base_sequence('tattattat',True)
Out[55]:
False
In [51]:
set('tattattat')
Out[51]:
{'a', 't'}
In [56]:
RNAbases
Out[56]:
{'A', 'C', 'G', 'U', 'a', 'c', 'g', 'u'}
In [3]:
validate_base_sequence('atgcwrqatgc')
Out[3]:
False

Sequences

  • Sequences are ordered collections that may contain duplicate elements. (Mutable)
  • Their elements can be referenced by position.

  • There are six built-in sequence types:


  • All sequence types except for ranges and files support the methods count and index .
  • In addition, all sequence types support the reversed function, which returns a special object that produces the elements of the sequence in reverse order.

Strings, Bytes & Bytearrays

  • Strings are sequences of Unicode characters (though there is no "character" type).
    • Unicode is an international standard that assigns a unique number to every character of all major languages (including special "languages" such as musical notation and mathematics), as well as special characters such as diacritics and technical symbols.
  • The bytes and bytearray types are sequences of single bytes.
  • bytes and bytearrays have same function as strings but an important difference is -
    • Slicing strings returns strings while specifying an element or slice whereas bytes and bytearrays resturns same element while slicing but indexing and element resturns an integer from 0 through 255.
    • The bytes and bytearray types serve many purposes, one of which is to efficiently store and manipulate raw data.

Creating

  • str() - returns empty string
  • str(obj) - Returns a printable representation of obj , as specified by the definition of the type of obj
  • chr(n) - Returns the one-character string corresponding to the integer n in the Unicode system
  • ord(char) - Returns the Unicode number corresponding to the one-character string char
In [4]:
str()
Out[4]:
''
In [5]:
str(4)
Out[5]:
'4'
In [1]:
chr(105)
Out[1]:
'i'
In [8]:
ord('D')
Out[8]:
68

Testing

  • str1.isalpha() - Returns true if str1 is not empty and all of its characters are alphabetic
  • str1.isdigit() - Returns true if str1 is not empty and all of its characters are digits
  • str1.islower() - Returns true if str1 contains at least one "cased" character and all of its cased characters are lowercase
  • str1.isupper() - Returns true if str1 contains at least one "cased" character and all of its cased characters are uppercase

Searching

  • str1.startswith(str2[, startpos, [endpos]]) - Returns true if str1 starts with str2
  • str1.endswith(str2[, startpos, [endos]]) - Returns true if str1 ends with str2
  • str1.find(str2[, startpos[, endpos]]) - Returns the lowest index of str1 at which str2 is found, or -1 if it is not found
  • str1.index(str2[, startpos[, endpos]]) - Returns the lowest index of str1 at which str2 is found, or ValueError if it is not found
  • str1.count(str2[, startpos[, endpos]]) - Returns the number of occurrences of str2 in str1

Replacing

  • str1.replace(oldstr, newstr[, count]) - Returns a copy of str1 with all occurrences of the substring oldstr replaced by the string newstr ; if count is specified, only the first count occurrences are replaced.

Changing case

  • str1.lower() - Returns a copy of the string with all of its characters converted to lowercase
  • str1.upper() - Returns a copy of the string with all of its characters converted to uppercase
  • str1.capitalize() - Returns a copy of the string with only its first character capitalized; has no effect if the first character is not a letter (e.g., if it is a space)
  • str1.title() - Returns a copy of the string with each word beginning with an uppercase character and the rest lowercase
  • str1.swapcase() - Returns a copy of the string with lowercase characters made uppercase and vice versa

Reformating

  • str1.lstrip([chars]) - Returns a copy of str1 with leading characters removed.
  • str1.rstrip([chars]) - Returns a copy of str1 with trailing characters removed.
  • str1.strip([chars]) - Returns a copy of str1 with leading and trailing characters removed.
  • str1.ljust(width[, fillchar]) - Returns str1 left-justified in a new string of length width , "padded" with fillchar (the default fill character is a space).
  • str1.rjust(width[, fillchar]) - Returns str1 right-justified in a new string of length width , "padded" with fillchar (the default fill character is a space).
  • str1.center(width[, fillchar]) - Returns str1 centered in a new string of length width , "padded" with fillchar (the default fill character is a space).

format method

  • format(value[, format-specification]) - Returns a string obtained by formatting value according to the formatspecification ; if no format-specification is provided, this is equivalent to str(value).
  • format-specification.format(posargs, ..., kwdargs, ...) - Returns a string formatted according to the format-specification; any number of positional arguments may be followed by any number of keyword arguments
In [1]:
str1 = 'a string'
In [2]:
'"{0}" contains {1} characters'.format(str1, len(str1))
Out[2]:
'"a string" contains 8 characters'
In [3]:
'"{}" contains {} characters'.format(str1, len(str1))
Out[3]:
'"a string" contains 8 characters'
In [4]:
'"{string}" contains {length} characters'.format(string=str1, length=len(str1))
Out[4]:
'"a string" contains 8 characters'
In [5]:
'"{string}" contains {length} characters'.format(length=len(str1),string=str1)
Out[5]:
'"a string" contains 8 characters'

Ranges

A range represents a series of integers.

  • range(stop) - creates a range representing the integers from 0 up to but not including stop .
  • range(start, stop) - creates a range representing the integers from start up to but not including stop.
  • range(start, stop,step) - creates a range representing the integers from start up to but not including stop , in increments of step .
In [6]:
range(5)
Out[6]:
range(0, 5)
In [7]:
set(range(5))
Out[7]:
{0, 1, 2, 3, 4}
In [8]:
set(range(5, 10))
Out[8]:
{5, 6, 7, 8, 9}
In [9]:
set(range(5, 10, 2))
Out[9]:
{5, 7, 9}
In [11]:
set(range(15, 10, -2))
Out[11]:
{11, 13, 15}
In [13]:
set(range(0, -25, -5))
Out[13]:
{-20, -15, -10, -5, 0}

Tuples

  • A tuple is an immutable sequence that can contain any type of element.
  • Tuples are written as comma-separated series of items surrounded by parentheses.
  • one-element tuples must be written with a comma after their single element.
  • An empty pair of parentheses creates an empty tuple.
  • Given a sequence as an argument, the tuple function creates a tuple containing the elements of the sequence.
In [14]:
('TCAG', 'UCAG') # 2 element tuple
Out[14]:
('TCAG', 'UCAG')
In [15]:
('TCAG',) # 1 element tuple
Out[15]:
('TCAG',)
In [16]:
() # empty tuple
Out[16]:
()
In [17]:
('TCAG') # Not a tuple!
Out[17]:
'TCAG'
In [18]:
tuple('TCAG')
Out[18]:
('T', 'C', 'A', 'G')
In [19]:
tuple(range(5,10))
Out[19]:
(5, 6, 7, 8, 9)

Tuple packing & unpacking

  • The righthand side of an assignment statement can be a series of comma-separated expressions. The result is a tuple with those values. This is called tuple packing
In [20]:
bases = 'TCAG', 'UCAG'
In [21]:
bases
Out[21]:
('TCAG', 'UCAG')
In [22]:
DNABases, RNABases = 'TCAG', 'UCAG'
In [23]:
DNABases
Out[23]:
'TCAG'
In [24]:
RNABases
Out[24]:
'UCAG'
In [25]:
def recognition_site(base_seq, recognition_seq):
    return base_seq.find(recognition_seq)
In [27]:
def restriction_cut(base_seq, recognition_seq, offset = 0):
    """Return a pair of sequences derived from base_seq by splitting it at the first appearance
    of recognition_seq; offset, which may be negative, is the number of bases relative to the
    beginning of the site where the sequence is cut"""
    site = recognition_site(base_seq, recognition_seq)
    return base_seq[:site+offset], base_seq[site+offset:]
In [28]:
aseq1 = 'AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA'
In [29]:
left, right = restriction_cut(aseq1, 'TCCGGA')
In [30]:
left
Out[30]:
'AAAAATCCCGAGGCGGCTATATAGGGC'
In [31]:
right
Out[31]:
'TCCGGAGGCGTAATATAAAA'
In [32]:
a, b = 4, 2
In [33]:
a
Out[33]:
4
In [34]:
b
Out[34]:
2
In [35]:
a, b = b, a
In [36]:
a
Out[36]:
2
In [37]:
b
Out[37]:
4

Lists

  • A list is a mutable sequence of any kind of element.
  • The syntax for lists is a comma-separated series of values enclosed in square brackets.


Deleting

  • del lst[n] - remove the nth element from lst
  • del lst[i:j] - remove the ith through jth elements from lst
  • del lst[i:j:k] - remove every k elements from i up to j from lst

In [39]:
list1 = [1,2,3]
In [40]:
list2 = [4,5]
In [42]:
list1 + list2 # Concatenation, list1 & list2 remain unchanged
Out[42]:
[1, 2, 3, 4, 5]
In [43]:
list1.extend(list2)
In [44]:
list1
Out[44]:
[1, 2, 3, 4, 5]
In [45]:
list2
Out[45]:
[4, 5]

Sequence-oriented string methods

  • string.splitlines([keepflg]) - Returns a list of the “lines” in string , splitting at end-of-line characters. If keepflg is omitted or is false, the end-of-line characters are not included in the lines; otherwise, they are.
  • string.split([sepr[, maxwords]]) - Returns a list of the “words” in string , using sepr as a word delineator. In the special case where sepr is omitted or is None , words are delineated by any consecutive whitespace characters; if maxwords is specified the result will have at most maxwords +1 elements.
  • string.rsplit([sepr[, maxwords]]) - Performs a reverse split: same as split except that if maxwords is specified and its value is less than the number of words in string the result returned is a list containing the last maxwords+1 words.
  • sepr.join(seq) - Returns a string formed by concatenating the strings in seq separated by sepr, which can be any string (including the empty string).
  • string.partition(sepr) - Returns a tuple with three elements: the portion of string up to the first occurrence of sepr , sepr , and the portion of string after the first occurrence of sepr . If sepr is not found in string , the tuple is (string, '', '').
  • string.rpartition(sepr) - Returns a tuple with three elements: the portion of string up to the last occurrence of sepr , sepr , and the portion of string after the last occurrence of sepr . If sepr is not found in string , the tuple is ('', '', string)
In [4]:
a = [ 'a', 'b', 'c', 'd', 'e']
In [7]:
abc = ','.join(a)
In [8]:
abc
Out[8]:
'a,b,c,d,e'
In [9]:
abc.split()
Out[9]:
['a,b,c,d,e']
In [10]:
abc.split(',')
Out[10]:
['a', 'b', 'c', 'd', 'e']